Reading Datasets

We have two dataset for Bottle Database in CalCOFI

bottle <-read.csv('194903-202105_Bottle.csv')
cast <- read.csv('194903-202105_Cast.csv')
getwd()
## [1] "C:/Users/gvnal/OneDrive/MasaĂĽstĂĽ"

Cast Table

Bottle Table

In each dipping process, the bottles go deeper step by step, and after reaching a certain point, another dipping process is started, which starts to go down from 0 meters to the depths, step by step, and this repeats.

Summary Statistics

summary(bottle)
##     Cst_Cnt         Btl_Cnt          Sta_ID            Depth_ID        
##  Min.   :    1   Min.   :     1   Length:895371      Length:895371     
##  1st Qu.: 8551   1st Qu.:223844   Class :character   Class :character  
##  Median :17665   Median :447686   Mode  :character   Mode  :character  
##  Mean   :17748   Mean   :447686                                        
##  3rd Qu.:27426   3rd Qu.:671529                                        
##  Max.   :35644   Max.   :895398                                        
##                                                                        
##      Depthm           T_degC          Salnty          O2ml_L       
##  Min.   :   0.0   Min.   : 1.44   Min.   : 9.50   Min.   :-200.22  
##  1st Qu.:  45.0   1st Qu.: 7.71   1st Qu.:33.48   1st Qu.:   1.37  
##  Median : 125.0   Median :10.07   Median :33.86   Median :   3.45  
##  Mean   : 224.4   Mean   :10.82   Mean   :33.83   Mean   :   3.40  
##  3rd Qu.: 300.0   3rd Qu.:13.90   3rd Qu.:34.19   3rd Qu.:   5.50  
##  Max.   :5351.0   Max.   :99.00   Max.   :37.03   Max.   :  11.13  
##                   NA's   :10969   NA's   :47356   NA's   :169741   
##      STheta           O2Sat           Oxy_.mol.Kg           BtlNum      
##  Min.   : 17.00   Min.   :-3605.22   Min.   :-8740.57   Min.   : 0.0    
##  1st Qu.: 24.96   1st Qu.:   21.30   1st Qu.:   61.77   1st Qu.: 5.0    
##  Median : 25.99   Median :   54.50   Median :  151.49   Median :10.0    
##  Mean   : 25.81   Mean   :   57.23   Mean   :  149.11   Mean   :10.4    
##  3rd Qu.: 26.64   3rd Qu.:   97.70   3rd Qu.:  240.58   3rd Qu.:16.0    
##  Max.   :250.78   Max.   :  214.10   Max.   :  485.70   Max.   :25.0    
##  NA's   :52696    NA's   :204625     NA's   :204673     NA's   :756404  
##      RecInd          T_prec           T_qual           S_prec      
##  Min.   :3.000   Min.   : 1.000   Min.   :0.0      Min.   :  1.00  
##  1st Qu.:3.000   1st Qu.: 2.000   1st Qu.:6.0      1st Qu.:  2.00  
##  Median :3.000   Median : 2.000   Median :6.0      Median :  3.00  
##  Mean   :4.692   Mean   : 2.049   Mean   :7.3      Mean   :  2.78  
##  3rd Qu.:7.000   3rd Qu.: 2.000   3rd Qu.:9.0      3rd Qu.:  3.00  
##  Max.   :7.000   Max.   :34.000   Max.   :9.0      Max.   :146.00  
##                  NA's   :10963    NA's   :871355   NA's   :47387   
##      S_qual           P_qual           O_qual           SThtaq      
##  Min.   :  4.0    Min.   :3.00     Min.   :6.0      Min.   :6.0     
##  1st Qu.:  6.0    1st Qu.:9.00     1st Qu.:9.0      1st Qu.:6.0     
##  Median :  9.0    Median :9.00     Median :9.0      Median :9.0     
##  Mean   :  9.2    Mean   :8.99     Mean   :8.6      Mean   :8.1     
##  3rd Qu.:  9.0    3rd Qu.:9.00     3rd Qu.:9.0      3rd Qu.:9.0     
##  Max.   :344.0    Max.   :9.00     Max.   :9.0      Max.   :9.0     
##  NA's   :809756   NA's   :220697   NA's   :695736   NA's   :818947  
##      O2Satq           ChlorA           Chlqua           Phaeop      
##  Min.   :2.0      Min.   : 0.0     Min.   :8        Min.   :-3.9    
##  1st Qu.:9.0      1st Qu.: 0.0     1st Qu.:9        1st Qu.: 0.0    
##  Median :9.0      Median : 0.2     Median :9        Median : 0.1    
##  Mean   :8.6      Mean   : 0.5     Mean   :9        Mean   : 0.2    
##  3rd Qu.:9.0      3rd Qu.: 0.4     3rd Qu.:9        3rd Qu.: 0.2    
##  Max.   :9.0      Max.   :66.1     Max.   :9        Max.   :65.3    
##  NA's   :662620   NA's   :649705   NA's   :245686   NA's   :649706  
##      Phaqua           PO4uM             PO4q            SiO3uM      
##  Min.   :8        Min.   :0.0      Min.   :4        Min.   :  0.0   
##  1st Qu.:9        1st Qu.:0.5      1st Qu.:9        1st Qu.:  3.1   
##  Median :9        Median :1.6      Median :9        Median : 18.0   
##  Mean   :9        Mean   :1.6      Mean   :9        Mean   : 26.5   
##  3rd Qu.:9        3rd Qu.:2.5      3rd Qu.:9        3rd Qu.: 41.6   
##  Max.   :9        Max.   :5.2      Max.   :9        Max.   :196.0   
##  NA's   :245682   NA's   :454462   NA's   :440666   NA's   :513686  
##      SiO3qu           NO2uM             NO2q            NO3uM       
##  Min.   :4        Min.   :0.0      Min.   :4        Min.   :-0.4    
##  1st Qu.:9        1st Qu.:0.0      1st Qu.:9        1st Qu.: 0.6    
##  Median :9        Median :0.0      Median :9        Median :18.1    
##  Mean   :9        Mean   :0.0      Mean   :9        Mean   :17.3    
##  3rd Qu.:9        3rd Qu.:0.0      3rd Qu.:9        3rd Qu.:30.0    
##  Max.   :9        Max.   :8.2      Max.   :9        Max.   :95.0    
##  NA's   :381514   NA's   :530929   NA's   :346291   NA's   :530373  
##       NO3q            NH3uM             NH3q           C14As1      
##  Min.   :4        Min.   : 0.0     Min.   :4.00    Min.   : -0.2   
##  1st Qu.:9        1st Qu.: 0.0     1st Qu.:9.00    1st Qu.:  0.9   
##  Median :9        Median : 0.0     Median :9.00    Median :  2.7   
##  Mean   :9        Mean   : 0.1     Mean   :8.83    Mean   :  9.9   
##  3rd Qu.:9        3rd Qu.: 0.1     3rd Qu.:9.00    3rd Qu.:  8.1   
##  Max.   :9        Max.   :33.6     Max.   :9.00    Max.   :584.5   
##  NA's   :357845   NA's   :804882   NA's   :62864   NA's   :879829  
##      C14A1p           C14A1q          C14As2           C14A2p      
##  Min.   :1.0      Min.   :8       Min.   : -0.2    Min.   :1.0     
##  1st Qu.:1.0      1st Qu.:9       1st Qu.:  0.9    1st Qu.:1.0     
##  Median :1.0      Median :9       Median :  2.6    Median :1.0     
##  Mean   :1.3      Mean   :9       Mean   :  9.9    Mean   :1.3     
##  3rd Qu.:2.0      3rd Qu.:9       3rd Qu.:  8.2    3rd Qu.:2.0     
##  Max.   :2.0      Max.   :9       Max.   :948.3    Max.   :2.0     
##  NA's   :882611   NA's   :17377   NA's   :879847   NA's   :882629  
##      C14A2q          DarkAs           DarkAp           DarkAq     
##  Min.   :8       Min.   :0.0      Min.   :1        Min.   :8      
##  1st Qu.:9       1st Qu.:0.1      1st Qu.:2        1st Qu.:9      
##  Median :9       Median :0.1      Median :2        Median :9      
##  Mean   :9       Mean   :0.2      Mean   :2        Mean   :9      
##  3rd Qu.:9       3rd Qu.:0.2      3rd Qu.:2        3rd Qu.:9      
##  Max.   :9       Max.   :9.8      Max.   :2        Max.   :9      
##  NA's   :17359   NA's   :871612   NA's   :874914   NA's   :25542  
##      MeanAs           MeanAp           MeanAq         IncTim         
##  Min.   : -0.2    Min.   :1.0      Min.   :8       Length:895371     
##  1st Qu.:  1.0    1st Qu.:1.0      1st Qu.:9       Class :character  
##  Median :  2.5    Median :1.0      Median :9       Mode  :character  
##  Mean   :  8.5    Mean   :1.3      Mean   :9                         
##  3rd Qu.:  7.1    3rd Qu.:2.0      3rd Qu.:9                         
##  Max.   :948.3    Max.   :2.0      Max.   :9                         
##  NA's   :871611   NA's   :874914   NA's   :25543                     
##      LightP          R_Depth           R_TEMP          R_Sal       
##  Min.   : 0.0     Min.   :   0.0   Min.   : 0.00   Min.   :   0.4  
##  1st Qu.: 0.2     1st Qu.:  45.0   1st Qu.: 7.76   1st Qu.: 144.1  
##  Median : 1.4     Median : 126.0   Median :10.12   Median : 203.9  
##  Mean   :17.3     Mean   : 225.9   Mean   :10.86   Mean   : 221.4  
##  3rd Qu.:21.0     3rd Qu.: 302.0   3rd Qu.:13.94   3rd Qu.: 300.2  
##  Max.   :99.9     Max.   :5458.0   Max.   :40.04   Max.   :5053.5  
##  NA's   :875604   NA's   :1        NA's   :46054   NA's   :52803   
##     R_DYNHT          R_Nuts       R_Oxy_.mol.Kg           DIC1       
##  Min.   :0.00    Min.   : 0.0     Min.   :-8740.57   Min.   :1949    
##  1st Qu.:0.13    1st Qu.: 0.0     1st Qu.:   61.77   1st Qu.:2028    
##  Median :0.34    Median : 0.0     Median :  151.49   Median :2171    
##  Mean   :0.43    Mean   : 0.1     Mean   :  149.11   Mean   :2153    
##  3rd Qu.:0.63    3rd Qu.: 0.1     3rd Qu.:  240.58   3rd Qu.:2254    
##  Max.   :3.88    Max.   :33.6     Max.   :  485.70   Max.   :2368    
##  NA's   :46688   NA's   :804862   NA's   :204673     NA's   :893372  
##       DIC2             TA1              TA2              pH1        
##  Min.   :1969     Min.   :2182     Min.   :2198     Min.   :7.6     
##  1st Qu.:2009     1st Qu.:2230     1st Qu.:2229     1st Qu.:7.9     
##  Median :2266     Median :2244     Median :2248     Median :7.9     
##  Mean   :2168     Mean   :2256     Mean   :2279     Mean   :7.9     
##  3rd Qu.:2316     3rd Qu.:2278     3rd Qu.:2316     3rd Qu.:8.0     
##  Max.   :2364     Max.   :2435     Max.   :2437     Max.   :8.0     
##  NA's   :895147   NA's   :893287   NA's   :895137   NA's   :895287  
##       pH2         DIC.Quality.Comment
##  Min.   :7.9      Length:895371      
##  1st Qu.:7.9      Class :character   
##  Median :7.9      Mode  :character   
##  Mean   :7.9                         
##  3rd Qu.:8.0                         
##  Max.   :8.0                         
##  NA's   :895361

qual, qua, or q (Quality Code - found in Bottle table; associated with discrete samples; examples: “O_qual”, “Chlqua”, “PO4q”):

We can not do na.omit because NA’s are valuable data !

bottle$O_qual.f<-factor(bottle$O_qual)
summary(bottle$O_qual.f)
##      6      8      9   NA's 
##  28439   1455 169741 695736

As we can see there is 169741 missing data and 1455 technician thinks value is suspect rows for O_qual

qualityCode<-select(bottle,ends_with(c('q','qua','_qual')))

summary(qualityCode)
##      SThtaq           O2Satq            PO4q             NO2q       
##  Min.   :6.0      Min.   :2.0      Min.   :4        Min.   :4       
##  1st Qu.:6.0      1st Qu.:9.0      1st Qu.:9        1st Qu.:9       
##  Median :9.0      Median :9.0      Median :9        Median :9       
##  Mean   :8.1      Mean   :8.6      Mean   :9        Mean   :9       
##  3rd Qu.:9.0      3rd Qu.:9.0      3rd Qu.:9        3rd Qu.:9       
##  Max.   :9.0      Max.   :9.0      Max.   :9        Max.   :9       
##  NA's   :818947   NA's   :662620   NA's   :440666   NA's   :346291  
##       NO3q             NH3q           C14A1q          C14A2q     
##  Min.   :4        Min.   :4.00    Min.   :8       Min.   :8      
##  1st Qu.:9        1st Qu.:9.00    1st Qu.:9       1st Qu.:9      
##  Median :9        Median :9.00    Median :9       Median :9      
##  Mean   :9        Mean   :8.83    Mean   :9       Mean   :9      
##  3rd Qu.:9        3rd Qu.:9.00    3rd Qu.:9       3rd Qu.:9      
##  Max.   :9        Max.   :9.00    Max.   :9       Max.   :9      
##  NA's   :357845   NA's   :62864   NA's   :17377   NA's   :17359  
##      DarkAq          MeanAq          Chlqua           Phaqua      
##  Min.   :8       Min.   :8       Min.   :8        Min.   :8       
##  1st Qu.:9       1st Qu.:9       1st Qu.:9        1st Qu.:9       
##  Median :9       Median :9       Median :9        Median :9       
##  Mean   :9       Mean   :9       Mean   :9        Mean   :9       
##  3rd Qu.:9       3rd Qu.:9       3rd Qu.:9        3rd Qu.:9       
##  Max.   :9       Max.   :9       Max.   :9        Max.   :9       
##  NA's   :25542   NA's   :25543   NA's   :245686   NA's   :245682  
##      T_qual           S_qual           P_qual           O_qual      
##  Min.   :0.0      Min.   :  4.0    Min.   :3.00     Min.   :6.0     
##  1st Qu.:6.0      1st Qu.:  6.0    1st Qu.:9.00     1st Qu.:9.0     
##  Median :6.0      Median :  9.0    Median :9.00     Median :9.0     
##  Mean   :7.3      Mean   :  9.2    Mean   :8.99     Mean   :8.6     
##  3rd Qu.:9.0      3rd Qu.:  9.0    3rd Qu.:9.00     3rd Qu.:9.0     
##  Max.   :9.0      Max.   :344.0    Max.   :9.00     Max.   :9.0     
##  NA's   :871355   NA's   :809756   NA's   :220697   NA's   :695736

From here S_qual has max 323, so we can not say every column that has qual, qua, or q is categorical variable. Table needed to inspected carefully to convert categorical variables into factor.

RecInd (Record Indicator - found in Bottle table):

Interpolated: It is the general name given to all the methods used to find/estimate the possible value at a point whose value is unknown.

bottle$RecInd.f<-factor(bottle$RecInd)
summary(bottle$RecInd.f)
##      3      4      5      6      7 
## 474987      2  81177   4202 335003

So we can say 335003 data is interpolated in this dataset and there is 474987 observed data

Predicted values are being removed from the dataset. We plan to focus on measured values rather than estimated values during analysis.

bottle<-filter(bottle,RecInd!=7)

Also wanted to remove duplicates from both cast and bottle table. “unique.data.frame” function basically removing duplicate rows from the table if exist.

bottle<-unique.data.frame(bottle)
cast<-unique.data.frame(cast)

One Variable Analysis

Wanted to count how many observations ships have had by descending arrange

orderingShipCounts <- cast %>%
  group_by(Ship_Name) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

Getting top 10 ship in figure 1, from this graph RV David Starr Jordan is the most ship used for observations

top10Data <- head(orderingShipCounts, 10)
ggplot(top10Data, aes(x = reorder(Ship_Name, -count), y = count)) +
  geom_bar(stat = "identity",fill="Orange") +
  labs(title = "Top 10 Ship Names",x = "Ship Name",y = "Count")+
  theme(axis.text.x = element_text(angle = 50, hjust = 1)) # To make labels with 50 degree we use this and for labels next to X line adding hjust=1
Figure 1:Top 10 ship names used for measurement

Figure 1:Top 10 ship names used for measurement

Well, we can draw a histogram to find out in which year the most measurements were made. In this way, we can see in which year the most measurements were made.

ggplot(cast,aes(Year))+
  geom_histogram(binwidth = 5,color='red',fill='orange')
Figure 2: Measurement counts by year

Figure 2: Measurement counts by year

From figure 2 we obtain a decreasing number of measurements since 1950. While there were over 4000 measurements in the first years, this number decreased to around 1500 measurements in the 2000s.

We can draw a bar chart to see the data types better. In this way, we can more easily observe which data type is available and how much.

Data_Type (Data Type - found in Cast table)

cast$Data_Type.f<-factor(cast$Data_Type)

ggplot(cast,aes(Data_Type.f))+
  geom_bar(fill='orange')+
  labs(x="Data Type", y="Count")
Figure 3: Data type counts

Figure 3: Data type counts

We can observe from figure 3 that hydrographic data type is the most common. We see that there is at least 10 meters of cast. We can see that there is also a lot of low resolution data here.

Also wanted to check station codes in bar graph to observe which station measured most.

Sta_Code (Station Codes - found in Cast table)

cast$Sta_Code.f<-factor(cast$Sta_Code)

ggplot(cast,aes(Sta_Code.f))+
  geom_bar(fill='orange')+
  labs(x="Stations", y="Count")
Figure 4: Data type counts

Figure 4: Data type counts

As we can see from figure 4, the most data is from the CalCofi station. However, not only calcofi station but also IMECOCAL station is in the dataset.

We can use a contingency table to find out how comprehensively the stations can search. The cast table contains the codes of the stations and the types of data. From here, we can learn which station extracted how much data and which data types were measured the most.

contTable= prop.table(table(cast$Sta_Code,cast$Data_Type))*100
contTable
##      
##                 10           CT           HY           MX           PR
##   IMX  0.002805521  3.024351925 13.275726630  0.000000000  0.064526989
##   MBR  0.000000000  0.252496914  0.819212210  0.187969925  0.022444170
##   NRO  0.002805521  1.596341600  9.019750870  0.025249691  0.078554595
##   NST  0.931433060  4.724497812 13.696554820  0.196386489  0.535854562
##   OCO  0.005611043  0.937044103  3.851980698  0.137470542  0.067332510
##   SCO  0.000000000  0.000000000  0.000000000  0.653686455  0.000000000
##   ST   0.098193244  3.007518797 19.641454382 17.489619571  5.653125351

From figure 5, we see that the most hydrographic data is provided by standard CalCOFI station, followed by “IMECOCAL” and “Non-Standard” stations with the most hydrographic data. The standard CalCOFI station ranks first in every data type. 10 meter cast is not very common in our dataset. “Mixed CTD and Bottle Data” type is provided only and largely from Standard CalCOFI Station. “MBARI” stations do not contribute much to this dataset.

Low resolution data types can be removed from the data set because they may provide incorrect information.

Two Variable Analysis

Do depth effects salinity?

model<- lm(Salnty~Depthm,bottle)

summary(model)
## 
## Call:
## lm(formula = Salnty ~ Depthm, data = bottle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4332 -0.2086  0.0414  0.2126  3.4043 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.362e+01  7.439e-04 45194.2   <2e-16 ***
## Depthm      9.368e-04  2.161e-06   433.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4423 on 535766 degrees of freedom
##   (24600 observations deleted due to missingness)
## Multiple R-squared:  0.2597, Adjusted R-squared:  0.2597 
## F-statistic: 1.879e+05 on 1 and 535766 DF,  p-value: < 2.2e-16

Adjusted R square is about 0.2597 which is not very accurate

Maybe temperature and depth effects salinity

model2<-lm(Salnty~Depthm+T_degC,bottle)

summary(model2)
## 
## Call:
## lm(formula = Salnty ~ Depthm + T_degC, data = bottle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3310 -0.1898  0.0303  0.1777  3.4057 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.389e+01  2.670e-03 12693.3   <2e-16 ***
## Depthm       7.330e-04  2.902e-06   252.6   <2e-16 ***
## T_degC      -2.040e-02  1.962e-04  -104.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.438 on 532520 degrees of freedom
##   (27845 observations deleted due to missingness)
## Multiple R-squared:  0.2745, Adjusted R-squared:  0.2745 
## F-statistic: 1.007e+05 on 2 and 532520 DF,  p-value: < 2.2e-16

Adjusted R square increased into 0.2734. Formula for salinity is salinity=3.395e+01+6.488e-04xdepth-2.416e-02xtemperature in this linear modal.

ggplot(bottle,aes(Depthm,Salnty))+
  geom_point(alpha=0.2)
Figure 6: Scatter plot for depth and salinity

Figure 6: Scatter plot for depth and salinity

From figure 6 as the depth increases, the salinity rate becomes more stable and remains constant. When we reach 4000 meters, the salinity remains constant, but at 0 meters, the salinity is quite variable. From this we can understand that salinity stabilizes as we deepen. At the same time, the cluster at the bottom of the chart may be showing us the outliers. It is useful to draw a boxplot for this, so we can visually see the outliers.

ggplot(bottle,aes(y=Salnty))+
  geom_boxplot()
Figure 7: Box plot for salinity

Figure 7: Box plot for salinity

In figure 7 can be seen, there are too many outliers. The IQR method will be applied to remove these outliers and the graph will be printed again, so we hope that the underlying cluster will disappear and provide us with a more accurate graph.

q1 <- quantile(bottle$Salnty, 0.25,na.rm=TRUE)
q3 <- quantile(bottle$Salnty, 0.75,na.rm=TRUE)

IQR<-IQR(bottle$Salnty,na.rm=TRUE)

lower_bound <- q1 - 1.5 * IQR
upper_bound <- q3 + 1.5 * IQR

bottle<-filter(bottle, Salnty>=lower_bound&Salnty<=upper_bound)

After clearing the outlines, 27289 rows are deleted from our data set. That means there were 27289 outliers.

ggplot(bottle,aes(Depthm,Salnty))+
  geom_point(alpha=0.2)
Figure 8: Scatter plot for depth and salinity

Figure 8: Scatter plot for depth and salinity

When we plot the graph again ( figure 8 ), we realize that the cluster at the bottom has disappeared, giving us a more accurate graph. At the same time, we observe here again the stabilization of salinity as we get deeper, which we mentioned at first.

summary(lm(Salnty~Depthm,bottle))
## 
## Call:
## lm(formula = Salnty ~ Depthm, data = bottle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9723 -0.2103  0.0384  0.2087  1.5879 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.362e+01  6.054e-04 55540.5   <2e-16 ***
## Depthm      9.400e-04  1.755e-06   535.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3588 on 533077 degrees of freedom
## Multiple R-squared:   0.35,  Adjusted R-squared:   0.35 
## F-statistic: 2.871e+05 on 1 and 533077 DF,  p-value: < 2.2e-16

Also adjusted R squared has increased, bottle will be used in the rest of the analysis.

We wanted to check salinity vs depth graph for each ship. To do this analysis we need to merge two table, Sta_ID is the common key for these two table so we can merge bottle and cast table together by Sta_ID

cast_aggregated <- cast %>%
  group_by(Sta_ID) %>%
  summarize(Ship_Name = first(Ship_Name))

First grouping Ship_Names by Sta_ID, so when R checking Sta_ID will be represent for ship name

merged_table <- left_join(bottle, cast_aggregated, by = 'Sta_ID')

Then merging this two table by Sta_ID, only Ship_Name is added into end of the bottle table

ggplot(bottle,aes(Depthm,Salnty))+
  geom_point()+
  facet_wrap(~merged_table$Ship_Name)
Figure 9: Scatter plot for depth and salinity seperated by ship names

Figure 9: Scatter plot for depth and salinity seperated by ship names

Using facet wrap to check salinity vs depth graph seperately by ship names From figure 9 we can say Ekvator, Ellen B. Scripps, Westwind and Yellowfin these ships did not dip the bottles any deeper during the measurement. Deepest measurement has done by RV Alexander Agassiz and Horizon

Do depth effects temperature?

ggplot(bottle,aes(Depthm,T_degC))+
  geom_point()
Figure 10: Scatter plot for depth and temperature

Figure 10: Scatter plot for depth and temperature

When we examine the graph of temperature and depth in figure 10, we observe that as we go deeper, the temperature approaches zero and decreases exponentially.

In the graph, we see an additional bulge in this exponential decrease. We plot an additional graph to observe which ships are responsible for this protrusion.

ggplot(bottle,aes(Depthm,T_degC))+
  geom_point()+
    facet_wrap(~merged_table$Ship_Name)
Figure 11: Scatter plot for depth and temperature by ship names

Figure 11: Scatter plot for depth and temperature by ship names

From figure 11, we observe that the ships RV Alexander Agassiz, Black Douglas, Spencer F. Baird and Stranger measured these salinity rates incorrectly.

Dependent Variable Explanation

Salinity

We want to see how salinity relates to other things they measure. This is really important for studying the ocean. By saying that what we are trying to solve for is salinity as the dependent variable, we can see how different things in the environment change the salinity. We will see how and to what extent molecules affect salinity with the regression model.

Temperature

To see whether molecules affect the temperature, we take the temperature as a dependent variable and consider the molecules as independent. As a result of this model, we aim to see which molecules affect the temperature and to what extent.

Wave Height

We created a model to learn what determines wave heights. In addition to the wind effect, how often waves are formed per second, atmospheric pressure, their direction, etc. we took these as an independent variables first. As a result of this model, we will learn which variable affects how much wave height.

Correlation Analysis

correlationMatrix <- cor(bottle[,c("T_degC","Salnty","Depthm","Phaeop","PO4uM","SiO3uM","NO2uM","NO3uM","NH3uM","DarkAs")], use = "complete.obs")
corrplot(correlationMatrix, method = "circle")
Figure 12: Correlation Matrix

Figure 12: Correlation Matrix

cor.test(bottle$SiO3uM,bottle$PO4uM)
## 
##  Pearson's product-moment correlation
## 
## data:  bottle$SiO3uM and bottle$PO4uM
## t = 1177.4, df = 233054, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9246615 0.9258301
## sample estimates:
##      cor 
## 0.925248
cor.test(bottle$T_degC,bottle$PO4uM)
## 
##  Pearson's product-moment correlation
## 
## data:  bottle$T_degC and bottle$PO4uM
## t = -1101.1, df = 254311, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9098530 -0.9085052
## sample estimates:
##        cor 
## -0.9091815

In figure 12, we can see the correlation between the variables. The darker the color, the more correlated the variables are. The lighter the color, the less correlated the variables are. The diagonal line shows the correlation of the variables with themselves. The correlation between the variables is not very high. The highest correlation is between SiO3uM and PO4uM. The correlation between these two variables is 0.954. Most negative correlation is PO4uM and T_degC got -0.904.

Modeling and Implementation

Salinity

SalDepth<-lm(Salnty~Depthm+ChlorA+Phaeop+PO4uM+SiO3uM+NO2uM+NO3uM+NH3uM+DarkAs,bottle)

summary(SalDepth)
## 
## Call:
## lm(formula = Salnty ~ Depthm + ChlorA + Phaeop + PO4uM + SiO3uM + 
##     NO2uM + NO3uM + NH3uM + DarkAs, data = bottle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00653 -0.09487  0.00774  0.10265  0.52308 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  3.367e+01  1.313e-02 2563.164  < 2e-16 ***
## Depthm      -1.232e-03  9.736e-05  -12.650  < 2e-16 ***
## ChlorA       6.183e-03  2.221e-03    2.784   0.0054 ** 
## Phaeop       1.215e-01  1.309e-02    9.279  < 2e-16 ***
## PO4uM       -1.257e+00  4.457e-02  -28.205  < 2e-16 ***
## SiO3uM       1.705e-02  2.153e-03    7.919 3.08e-15 ***
## NO2uM        4.262e-02  2.821e-02    1.511   0.1309    
## NO3uM        8.426e-02  3.005e-03   28.040  < 2e-16 ***
## NH3uM        8.047e-02  8.386e-03    9.596  < 2e-16 ***
## DarkAs       4.039e-02  7.190e-03    5.618 2.07e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1563 on 3992 degrees of freedom
##   (529077 observations deleted due to missingness)
## Multiple R-squared:  0.3739, Adjusted R-squared:  0.3725 
## F-statistic: 264.9 on 9 and 3992 DF,  p-value: < 2.2e-16

In first model, we used all variables to predict salinity. But our adjusted R-squared is pretty low so we decided to remove some variables and try again. When we check the significance codes of the variables we can see that NO2uM is not significant. So we decided to remove this variable and try again.

SalDepth2<-lm(Salnty~Depthm+ChlorA+Phaeop+PO4uM+SiO3uM+NO3uM+NH3uM+DarkAs,bottle)

summary(SalDepth2)
## 
## Call:
## lm(formula = Salnty ~ Depthm + ChlorA + Phaeop + PO4uM + SiO3uM + 
##     NO3uM + NH3uM + DarkAs, data = bottle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01694 -0.09501  0.00883  0.10298  0.52552 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  3.366e+01  1.281e-02 2626.826  < 2e-16 ***
## Depthm      -1.234e-03  9.736e-05  -12.675  < 2e-16 ***
## ChlorA       6.125e-03  2.221e-03    2.758  0.00585 ** 
## Phaeop       1.247e-01  1.292e-02    9.656  < 2e-16 ***
## PO4uM       -1.237e+00  4.259e-02  -29.051  < 2e-16 ***
## SiO3uM       1.669e-02  2.140e-03    7.799 7.93e-15 ***
## NO3uM        8.351e-02  2.965e-03   28.167  < 2e-16 ***
## NH3uM        8.397e-02  8.059e-03   10.420  < 2e-16 ***
## DarkAs       4.042e-02  7.191e-03    5.620 2.04e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1564 on 3993 degrees of freedom
##   (529077 observations deleted due to missingness)
## Multiple R-squared:  0.3735, Adjusted R-squared:  0.3723 
## F-statistic: 297.6 on 8 and 3993 DF,  p-value: < 2.2e-16

In second iteration, we removed NO2uM variable and our adjusted R-squared decreased little bit. But we can see that all variables are significant. Wanted R-squared value higher, so we decided to remove ChlorA variable and try again. The reason why ChlorA removed is that it has the highest p-value among the significant variables.

SalDepth3<-lm(Salnty~Depthm+Phaeop+PO4uM+SiO3uM+NO3uM+NH3uM+DarkAs,bottle)

summary(SalDepth3)
## 
## Call:
## lm(formula = Salnty ~ Depthm + Phaeop + PO4uM + SiO3uM + NO3uM + 
##     NH3uM + DarkAs, data = bottle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00655 -0.09422  0.00739  0.10312  0.52822 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  3.366e+01  1.280e-02 2630.348  < 2e-16 ***
## Depthm      -1.269e-03  9.664e-05  -13.126  < 2e-16 ***
## Phaeop       1.496e-01  9.263e-03   16.147  < 2e-16 ***
## PO4uM       -1.251e+00  4.234e-02  -29.537  < 2e-16 ***
## SiO3uM       1.743e-02  2.125e-03    8.205 3.09e-16 ***
## NO3uM        8.376e-02  2.966e-03   28.240  < 2e-16 ***
## NH3uM        8.264e-02  8.051e-03   10.264  < 2e-16 ***
## DarkAs       4.344e-02  7.113e-03    6.106 1.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1565 on 3994 degrees of freedom
##   (529077 observations deleted due to missingness)
## Multiple R-squared:  0.3724, Adjusted R-squared:  0.3713 
## F-statistic: 338.5 on 7 and 3994 DF,  p-value: < 2.2e-16

In third iteration, our adjusted R-squared decreased again. Need to remove another variable. We decided to remove Phaeop variable because it has the highest p-value among the significant variables. Removed DarkAs variable because it has the highest p-value among the significant variables.

SalDepth4<-lm(Salnty~Depthm+Phaeop+PO4uM+SiO3uM+NO3uM+NH3uM,bottle)
summary(SalDepth4)
## 
## Call:
## lm(formula = Salnty ~ Depthm + Phaeop + PO4uM + SiO3uM + NO3uM + 
##     NH3uM, data = bottle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92191 -0.08798 -0.00016  0.09308  0.81508 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  3.350e+01  3.341e-03 10026.60   <2e-16 ***
## Depthm      -8.037e-04  2.249e-05   -35.74   <2e-16 ***
## Phaeop       1.130e-01  3.018e-03    37.45   <2e-16 ***
## PO4uM       -8.229e-01  1.040e-02   -79.15   <2e-16 ***
## SiO3uM       3.851e-02  3.125e-04   123.25   <2e-16 ***
## NO3uM        3.927e-02  6.777e-04    57.94   <2e-16 ***
## NH3uM        4.553e-02  2.002e-03    22.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1511 on 44165 degrees of freedom
##   (488907 observations deleted due to missingness)
## Multiple R-squared:  0.7253, Adjusted R-squared:  0.7252 
## F-statistic: 1.943e+04 on 6 and 44165 DF,  p-value: < 2.2e-16

In fourth iteration, our adjusted R-squared increased to 0.7252 which means our model explains 72.52% of the variance in our dependent variable. We can see that all variables are significant. After that plotting residuals vs fitted values to check if there is pattern.

ggplot(SalDepth4, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values", y = "Residuals")
Figure 13: Residuals vs Fitted values

Figure 13: Residuals vs Fitted values

In figure 13, we can see that there is no pattern in residuals vs fitted values. So we can say that our model is good.

As a result, increase of depth, PO4uM, SiO3uM, NO3uM and NH3uM causes increase of salinity. Increase of Phaeop causes decrease of salinity. Highest increase is caused by PO4uM. Highest decrease is caused by Phaeop.

Temperature

TempModel<-lm(T_degC~Depthm+Salnty+O2ml_L+ChlorA+Phaeop+PO4uM+SiO3uM+NO2uM+NO3uM+NH3uM+DarkAs,bottle)
summary(TempModel)
## 
## Call:
## lm(formula = T_degC ~ Depthm + Salnty + O2ml_L + ChlorA + Phaeop + 
##     PO4uM + SiO3uM + NO2uM + NO3uM + NH3uM + DarkAs, data = bottle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5911 -0.6921 -0.0652  0.6074  6.5136 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.574e+01  3.709e+00  -6.942 4.50e-12 ***
## Depthm      -1.845e-02  6.519e-04 -28.310  < 2e-16 ***
## Salnty       1.703e+00  1.075e-01  15.849  < 2e-16 ***
## O2ml_L      -2.200e+00  5.402e-02 -40.718  < 2e-16 ***
## ChlorA       1.174e-02  1.514e-02   0.776    0.438    
## Phaeop      -8.199e-01  8.740e-02  -9.381  < 2e-16 ***
## PO4uM       -5.401e+00  3.202e-01 -16.869  < 2e-16 ***
## SiO3uM      -2.328e-02  1.428e-02  -1.631    0.103    
## NO2uM       -1.371e+00  1.851e-01  -7.407 1.57e-13 ***
## NO3uM       -1.244e-01  2.206e-02  -5.638 1.83e-08 ***
## NH3uM        2.317e-01  5.562e-02   4.166 3.17e-05 ***
## DarkAs       2.301e-01  4.742e-02   4.852 1.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 3987 degrees of freedom
##   (529080 observations deleted due to missingness)
## Multiple R-squared:  0.8134, Adjusted R-squared:  0.8128 
## F-statistic:  1580 on 11 and 3987 DF,  p-value: < 2.2e-16

In first model, using all possible related variables to predict temperature. But our adjusted R-squared is pretty low so we decided to remove some variables and try again. When we check the significance codes of the variables we can see that SiO3uM is not significant. So we decided to remove this variable and try again.

TempModel2<-lm(T_degC~Depthm+Salnty+O2ml_L+ChlorA+Phaeop+PO4uM+NO2uM+NO3uM+NH3uM+DarkAs,bottle)
summary(TempModel2)
## 
## Call:
## lm(formula = T_degC ~ Depthm + Salnty + O2ml_L + ChlorA + Phaeop + 
##     PO4uM + NO2uM + NO3uM + NH3uM + DarkAs, data = bottle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5766 -0.6977 -0.0657  0.6046  6.4734 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.521e+01  3.695e+00  -6.822 1.03e-11 ***
## Depthm      -1.838e-02  6.503e-04 -28.260  < 2e-16 ***
## Salnty       1.686e+00  1.070e-01  15.762  < 2e-16 ***
## O2ml_L      -2.192e+00  5.384e-02 -40.717  < 2e-16 ***
## ChlorA       8.415e-03  1.500e-02   0.561    0.575    
## Phaeop      -8.053e-01  8.696e-02  -9.261  < 2e-16 ***
## PO4uM       -5.569e+00  3.033e-01 -18.360  < 2e-16 ***
## NO2uM       -1.338e+00  1.840e-01  -7.271 4.28e-13 ***
## NO3uM       -1.320e-01  2.157e-02  -6.119 1.03e-09 ***
## NH3uM        2.383e-01  5.548e-02   4.295 1.79e-05 ***
## DarkAs       2.239e-01  4.728e-02   4.736 2.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 3988 degrees of freedom
##   (529080 observations deleted due to missingness)
## Multiple R-squared:  0.8132, Adjusted R-squared:  0.8128 
## F-statistic:  1737 on 10 and 3988 DF,  p-value: < 2.2e-16

In second iteration, when we check the adjusted R-squared value, we can see our adjusted R-squared stay same. But we can see that ChlorA variable is not significant. So we decided to remove this variable and check model again.

TempModel3<-lm(T_degC~Depthm+Salnty+O2ml_L+Phaeop+PO4uM+NO2uM+NO3uM+NH3uM+DarkAs,bottle)
summary(TempModel3)
## 
## Call:
## lm(formula = T_degC ~ Depthm + Salnty + O2ml_L + Phaeop + PO4uM + 
##     NO2uM + NO3uM + NH3uM + DarkAs, data = bottle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5567 -0.6982 -0.0627  0.6053  6.4738 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.550e+01  3.656e+00  -6.975 3.57e-12 ***
## Depthm      -1.842e-02  6.463e-04 -28.497  < 2e-16 ***
## Salnty       1.693e+00  1.061e-01  15.961  < 2e-16 ***
## O2ml_L      -2.184e+00  5.195e-02 -42.050  < 2e-16 ***
## Phaeop      -7.757e-01  6.902e-02 -11.238  < 2e-16 ***
## PO4uM       -5.575e+00  3.030e-01 -18.398  < 2e-16 ***
## NO2uM       -1.342e+00  1.838e-01  -7.302 3.41e-13 ***
## NO3uM       -1.308e-01  2.147e-02  -6.093 1.21e-09 ***
## NH3uM        2.359e-01  5.532e-02   4.265 2.05e-05 ***
## DarkAs       2.275e-01  4.683e-02   4.858 1.23e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 3989 degrees of freedom
##   (529080 observations deleted due to missingness)
## Multiple R-squared:  0.8132, Adjusted R-squared:  0.8128 
## F-statistic:  1930 on 9 and 3989 DF,  p-value: < 2.2e-16

In third iteration R-squared value still same. But wanted to try remove DarkAs variable because in salinity model when we removed DarkAs our adjusted R-squared increased a lot. Also milligrams carbon per cubic meter of seawater might not be related to temperature. So we decided to remove this variable and check model again.

TempModel4<-lm(T_degC~Depthm+Salnty+O2ml_L+Phaeop+PO4uM+NO2uM+NO3uM+NH3uM,bottle)
summary(TempModel4)
## 
## Call:
## lm(formula = T_degC ~ Depthm + Salnty + O2ml_L + Phaeop + PO4uM + 
##     NO2uM + NO3uM + NH3uM, data = bottle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3924  -0.5663  -0.0534   0.4511  13.4670 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.037e+01  1.025e+00  -49.15   <2e-16 ***
## Depthm      -1.409e-02  1.372e-04 -102.73   <2e-16 ***
## Salnty       2.293e+00  2.933e-02   78.18   <2e-16 ***
## O2ml_L      -1.487e+00  1.394e-02 -106.64   <2e-16 ***
## Phaeop      -8.882e-01  2.073e-02  -42.85   <2e-16 ***
## PO4uM       -3.279e+00  6.902e-02  -47.51   <2e-16 ***
## NO2uM       -1.782e+00  5.090e-02  -35.01   <2e-16 ***
## NO3uM       -2.140e-01  4.436e-03  -48.24   <2e-16 ***
## NH3uM        1.511e-01  1.291e-02   11.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9353 on 44142 degrees of freedom
##   (488928 observations deleted due to missingness)
## Multiple R-squared:  0.9022, Adjusted R-squared:  0.9022 
## F-statistic: 5.09e+04 on 8 and 44142 DF,  p-value: < 2.2e-16

After removing DarkAs variable, our adjusted R-squared increased to 0.9022. So our model pretty good. We can see that all variables are significant. After that plotting residuals vs fitted values to check if there is pattern.

ggplot(TempModel4, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values", y = "Residuals")
Figure 14: Residuals vs Fitted values

Figure 14: Residuals vs Fitted values

In figure 14, we can see that there is no pattern in residuals vs fitted values. So we can say that our model is good.

As a result, increase of depth, salinity, O2ml_L, Phaeop, PO4uM, NO2uM and NO3uM causes decrease of temperature. Increase of NH3uM causes increase of temperature. Highest decrease is caused by PO4uM. Highest increase is caused by salinity.

Wave Height

Wea means 1 Digit Code from The World Meteorlogical Organization, Code source WMO 4501.

  • 0 Clear(No cloud at any level)
  • 1 Partly cloudy(Scattered or broken)
  • 2 Continuous layer(s) of blowing snow
  • 3 Sandstorm, duststorm, or blowing snow
  • 4 Fog, thick dust or haze
  • 5 Drizzle
  • 6 Rain
  • 7 Snow, or rain and snow mixed
  • 8 Shower(s)
  • 9 Thunderstorm(s)
# converting date as date format. %d/%m/%Y stays for day/month/year
cast$date<-as.Date(cast$Date, format = "%d/%m/%Y")
cast$Day <- as.numeric(format(cast$date, "%d"))
cast$Month <- as.numeric(format(cast$date, "%m"))

# Transforming Wea to factor
cast$Wea<-factor(cast$Wea)

We can start writing our model. For this model we will use cast table.

Wavemodel1<- lm(Wave_Ht~Wave_Prd+Wind_Spd+Dry_T+Wet_T+Barometer+Ac_Line+Ac_Sta+Secchi+Cloud_Amt+Wea+Day+Month+Year,cast)
summary(Wavemodel1)
## 
## Call:
## lm(formula = Wave_Ht ~ Wave_Prd + Wind_Spd + Dry_T + Wet_T + 
##     Barometer + Ac_Line + Ac_Sta + Secchi + Cloud_Amt + Wea + 
##     Day + Month + Year, data = cast)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7112 -1.0929 -0.1166  0.8754 17.8671 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 92.575079  12.490137   7.412 1.82e-13 ***
## Wave_Prd     0.347410   0.019901  17.457  < 2e-16 ***
## Wind_Spd     0.184925   0.006615  27.956  < 2e-16 ***
## Dry_T       -0.231620   0.043794  -5.289 1.36e-07 ***
## Wet_T        0.023468   0.039811   0.589 0.555606    
## Barometer   -0.041558   0.011135  -3.732 0.000195 ***
## Ac_Line      0.003222   0.003779   0.853 0.394003    
## Ac_Sta       0.013859   0.001599   8.668  < 2e-16 ***
## Secchi       0.014854   0.005984   2.482 0.013132 *  
## Cloud_Amt   -0.010642   0.022393  -0.475 0.634682    
## Wea1         0.154305   0.160445   0.962 0.336299    
## Wea2        -0.209566   0.220517  -0.950 0.342056    
## Wea3        -0.831516   1.238643  -0.671 0.502099    
## Wea4         0.158467   0.325142   0.487 0.626045    
## Wea5         0.006820   0.442387   0.015 0.987701    
## Wea6        -2.422808   0.489680  -4.948 8.13e-07 ***
## Wea8        -1.283431   1.748368  -0.734 0.462989    
## Day          0.026084   0.014690   1.776 0.075947 .  
## Month        0.017376   0.011395   1.525 0.127451    
## Year        -0.024468   0.002949  -8.297  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.729 on 2010 degrees of freedom
##   (33614 observations deleted due to missingness)
## Multiple R-squared:  0.4918, Adjusted R-squared:  0.487 
## F-statistic: 102.4 on 19 and 2010 DF,  p-value: < 2.2e-16

In first model, we observing that there is lots of non-significant variables. So we decided to remove one of them and try again. We decided to remove Cloud_Amt variable because it has the highest p-value among the significant variables.

Wavemodel2<- lm(Wave_Ht~Wave_Prd+Wind_Spd+Dry_T+Wet_T+Barometer+Ac_Line+Ac_Sta+Secchi+Wea+Day+Month+Year,cast)
summary(Wavemodel2)
## 
## Call:
## lm(formula = Wave_Ht ~ Wave_Prd + Wind_Spd + Dry_T + Wet_T + 
##     Barometer + Ac_Line + Ac_Sta + Secchi + Wea + Day + Month + 
##     Year, data = cast)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7493 -1.0906 -0.1107  0.8783 17.7924 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 97.760059  12.338713   7.923 3.78e-15 ***
## Wave_Prd     0.351043   0.019703  17.817  < 2e-16 ***
## Wind_Spd     0.184235   0.006555  28.105  < 2e-16 ***
## Dry_T       -0.233693   0.042292  -5.526 3.70e-08 ***
## Wet_T        0.022726   0.038765   0.586   0.5578    
## Barometer   -0.043828   0.011054  -3.965 7.60e-05 ***
## Ac_Line      0.001906   0.003717   0.513   0.6082    
## Ac_Sta       0.013860   0.001566   8.850  < 2e-16 ***
## Secchi       0.014478   0.005918   2.446   0.0145 *  
## Wea1         0.106688   0.128473   0.830   0.4064    
## Wea2        -0.299777   0.140573  -2.133   0.0331 *  
## Wea3        -0.894834   1.228497  -0.728   0.4665    
## Wea4         0.154849   0.269763   0.574   0.5660    
## Wea5        -0.083540   0.409371  -0.204   0.8383    
## Wea6        -2.533865   0.458737  -5.524 3.75e-08 ***
## Wea8        -1.343917   1.740985  -0.772   0.4402    
## Day          0.028017   0.014546   1.926   0.0542 .  
## Month        0.018982   0.011307   1.679   0.0934 .  
## Year        -0.025837   0.002905  -8.895  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.728 on 2037 degrees of freedom
##   (33588 observations deleted due to missingness)
## Multiple R-squared:  0.4939, Adjusted R-squared:  0.4894 
## F-statistic: 110.4 on 18 and 2037 DF,  p-value: < 2.2e-16

Our adjusted R-squared value increased little bit. But we can see that there is still non-significant variables. Ac_Line has the highest p-value among the significant variables. So we decided to remove this variable and try again.

Wavemodel3<- lm(Wave_Ht~Wave_Prd+Wind_Spd+Dry_T+Wet_T+Barometer+Ac_Sta+Secchi+Wea+Day+Month+Year,cast)
summary(Wavemodel3)
## 
## Call:
## lm(formula = Wave_Ht ~ Wave_Prd + Wind_Spd + Dry_T + Wet_T + 
##     Barometer + Ac_Sta + Secchi + Wea + Day + Month + Year, data = cast)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7447 -1.0893 -0.1052  0.8826 17.8190 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 98.818072  12.162783   8.125 7.69e-16 ***
## Wave_Prd     0.351447   0.019684  17.855  < 2e-16 ***
## Wind_Spd     0.184482   0.006536  28.224  < 2e-16 ***
## Dry_T       -0.226462   0.039864  -5.681 1.53e-08 ***
## Wet_T        0.019877   0.038358   0.518   0.6044    
## Barometer   -0.043862   0.011052  -3.969 7.48e-05 ***
## Ac_Sta       0.013644   0.001508   9.047  < 2e-16 ***
## Secchi       0.014956   0.005843   2.560   0.0106 *  
## Wea1         0.108375   0.128407   0.844   0.3988    
## Wea2        -0.294861   0.140220  -2.103   0.0356 *  
## Wea3        -0.885305   1.228134  -0.721   0.4711    
## Wea4         0.150568   0.269585   0.559   0.5766    
## Wea5        -0.076688   0.409079  -0.187   0.8513    
## Wea6        -2.523058   0.458170  -5.507 4.12e-08 ***
## Wea8        -1.340377   1.740657  -0.770   0.4414    
## Day          0.026329   0.014166   1.859   0.0632 .  
## Month        0.018813   0.011301   1.665   0.0961 .  
## Year        -0.026299   0.002761  -9.526  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.728 on 2038 degrees of freedom
##   (33588 observations deleted due to missingness)
## Multiple R-squared:  0.4938, Adjusted R-squared:  0.4896 
## F-statistic:   117 on 17 and 2038 DF,  p-value: < 2.2e-16

In third iteration, adjusted R-squared stays same but some variables still needed to removed. Removing Wet_T and checking new model.

Wavemodel4<- lm(Wave_Ht~Wave_Prd+Wind_Spd+Dry_T+Barometer+Ac_Sta+Secchi+Wea+Day+Month+Year,cast)
summary(Wavemodel4)
## 
## Call:
## lm(formula = Wave_Ht ~ Wave_Prd + Wind_Spd + Dry_T + Barometer + 
##     Ac_Sta + Secchi + Wea + Day + Month + Year, data = cast)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7623 -1.0993 -0.1099  0.8817 17.8223 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 99.052997  12.152151   8.151 6.23e-16 ***
## Wave_Prd     0.351329   0.019679  17.853  < 2e-16 ***
## Wind_Spd     0.184435   0.006535  28.225  < 2e-16 ***
## Dry_T       -0.208691   0.020322 -10.269  < 2e-16 ***
## Barometer   -0.044383   0.011004  -4.033 5.70e-05 ***
## Ac_Sta       0.013636   0.001508   9.044  < 2e-16 ***
## Secchi       0.014944   0.005842   2.558   0.0106 *  
## Wea1         0.107200   0.128364   0.835   0.4037    
## Wea2        -0.289949   0.139875  -2.073   0.0383 *  
## Wea3        -0.876692   1.227801  -0.714   0.4753    
## Wea4         0.173740   0.265803   0.654   0.5134    
## Wea5        -0.052051   0.406234  -0.128   0.8981    
## Wea6        -2.507660   0.457123  -5.486 4.63e-08 ***
## Wea8        -1.353820   1.740151  -0.778   0.4367    
## Day          0.027150   0.014075   1.929   0.0539 .  
## Month        0.018715   0.011297   1.657   0.0977 .  
## Year        -0.026156   0.002746  -9.524  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.727 on 2039 degrees of freedom
##   (33588 observations deleted due to missingness)
## Multiple R-squared:  0.4938, Adjusted R-squared:  0.4898 
## F-statistic: 124.3 on 16 and 2039 DF,  p-value: < 2.2e-16

Our adjusted R-squared value increasing little by little. Continueing to remove non-significant variables.

Wavemodel5<- lm(Wave_Ht~Wave_Prd+Wind_Spd+Dry_T+Barometer+Ac_Sta+Secchi+Wea+Day+Year,cast)
summary(Wavemodel5)
## 
## Call:
## lm(formula = Wave_Ht ~ Wave_Prd + Wind_Spd + Dry_T + Barometer + 
##     Ac_Sta + Secchi + Wea + Day + Year, data = cast)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7901 -1.0941 -0.1138  0.8904 17.7706 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 97.976438  12.139951   8.071 1.18e-15 ***
## Wave_Prd     0.350638   0.019683  17.814  < 2e-16 ***
## Wind_Spd     0.184407   0.006537  28.208  < 2e-16 ***
## Dry_T       -0.207761   0.020323 -10.223  < 2e-16 ***
## Barometer   -0.043896   0.011005  -3.989 6.88e-05 ***
## Ac_Sta       0.013573   0.001508   9.001  < 2e-16 ***
## Secchi       0.014875   0.005845   2.545   0.0110 *  
## Wea1         0.106712   0.128419   0.831   0.4061    
## Wea2        -0.288913   0.139933  -2.065   0.0391 *  
## Wea3        -0.883000   1.228320  -0.719   0.4723    
## Wea4         0.186126   0.265812   0.700   0.4839    
## Wea5        -0.037517   0.406312  -0.092   0.9264    
## Wea6        -2.481160   0.457039  -5.429 6.35e-08 ***
## Wea8        -1.387281   1.740778  -0.797   0.4256    
## Day          0.027306   0.014080   1.939   0.0526 .  
## Year        -0.025803   0.002739  -9.420  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.728 on 2040 degrees of freedom
##   (33588 observations deleted due to missingness)
## Multiple R-squared:  0.4931, Adjusted R-squared:  0.4894 
## F-statistic: 132.3 on 15 and 2040 DF,  p-value: < 2.2e-16

Removing Day variable and checking new model.

Wavemodel6<- lm(Wave_Ht~Wave_Prd+Wind_Spd+Dry_T+Barometer+Ac_Sta+Secchi+Wea+Year,cast)
summary(Wavemodel6)
## 
## Call:
## lm(formula = Wave_Ht ~ Wave_Prd + Wind_Spd + Dry_T + Barometer + 
##     Ac_Sta + Secchi + Wea + Year, data = cast)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.865 -1.125 -0.167  0.907 35.330 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.069e+02  7.797e+00  13.704  < 2e-16 ***
## Wave_Prd     3.505e-01  1.352e-02  25.917  < 2e-16 ***
## Wind_Spd     1.806e-01  4.308e-03  41.912  < 2e-16 ***
## Dry_T       -1.828e-01  1.162e-02 -15.729  < 2e-16 ***
## Barometer   -5.033e-02  7.053e-03  -7.136 1.10e-12 ***
## Ac_Sta       1.208e-02  9.823e-04  12.293  < 2e-16 ***
## Secchi       1.942e-02  3.664e-03   5.301 1.20e-07 ***
## Wea1         1.290e-02  7.883e-02   0.164  0.87001    
## Wea2        -2.409e-01  8.622e-02  -2.794  0.00522 ** 
## Wea3        -8.513e-01  1.276e+00  -0.667  0.50473    
## Wea4         1.887e-02  1.641e-01   0.115  0.90844    
## Wea5        -2.008e-01  2.707e-01  -0.742  0.45831    
## Wea6        -1.505e+00  3.050e-01  -4.935 8.29e-07 ***
## Wea8        -2.133e-01  9.056e-01  -0.236  0.81383    
## Year        -2.707e-02  1.760e-03 -15.374  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.802 on 5010 degrees of freedom
##   (30619 observations deleted due to missingness)
## Multiple R-squared:  0.4464, Adjusted R-squared:  0.4448 
## F-statistic: 288.5 on 14 and 5010 DF,  p-value: < 2.2e-16

So this iteration seems the best model. We can’t remove wea because wea6 is very significant. Other variables are also significant. So we can say that our model is good. We can check residuals vs fitted values to check if there is pattern.

ggplot(Wavemodel6, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values", y = "Residuals")
Figure 15: Residuals vs Fitted values

Figure 15: Residuals vs Fitted values

In figure 15, we can say that there is pattern pattern in residuals vs fitted values. So we can not say that our model is strong. But we wanted to check Q-Q residuals plot to check if residuals are normally distributed.

ggplot(Wavemodel6, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line()
Figure 16: Q-Q residuals plot

Figure 16: Q-Q residuals plot

In figure 16, we can see that residuals are normally distributed. Q-Q plot can be helpful in understanding how well the errors fit a normal distribution. The absence of a curved line could mean that the errors do not follow a normal distribution and that the model may make poor predictions.